This notebook describes a recipe for starting to scrape the UK MP register of interests.
[See also: benscott/mp-financial-interests] The register entries for each member contains semi-structured text data, which is to say that there are some recognisable patterns in the text that makes up the register entries.
The entries are made in different sections, and have a form that repeats, ish...
We can use these repeating structures as the basis of a scraper that will extract the information from the page and put it into a form we can work with, such as a spreadsheet or simple database.
The page you see in your web browser is a rendering of a structured HTML document. You can look at the "code" that defines the page using browser developer tools.
In Chrome, you can view the source code of a web page from the View -> Developer -> View Source menu.
Looking at the raw HTML of a page can be confusing, but many browsers have built in tools to make it easier to inspect the source code of a web page.
In Chrome, you can launch the developer tools from he View -> Developer -> Developer Tools menu option.
In Chrome developer tools, if you click on the arrow / pointer icon in the top left corner of the tools panel, you can use it to highlight areas of the rendered web page; the HTML used to define that block is then highlighted.
One of the tricks to scraping is to try to identify structural elements or patterns in the HTML that identify the things we are interested in and that we can grab hold of and use as the basis for our scrape.
In the MPs' register of interest pages, we notice that the div
tag with the id mainTextBlock
contains all the elements that describe the register of entries. In particular, we also notice that each spearate entry is contained within its own <p>
tag.
This gives us one strategy for scraping the page:
div
tag - and its contents - that contains all the separate register entries;<p>
tag - and its contents - in turn and try to pull out the member interests.Some of the register entries have a largely unstructured form. For example, the entry:
13 December 2016, received £2,500 from Hampshire Cricket, The Ageas Bowl, Botley Road, West End, Southampton SO30 3XH, for media and communications training. Hours: 16 hrs including travel and preparation. (Registered 19 December 2016)
is largely free text. We can see some structure in there (a date, followed by an amount, then a name and an address) but we get the feeling that this entry could be made up of arbitrary text.
An entry such as:
Name of donor: VGC Group
Address of donor: Cardinal House, Bury Street, Ruislip HA4 7GD
Amount of donation or nature and value if donation in kind: £1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.
Donor status: company, registration 5741473
(Registered 04 May 2016)
is semi-structured, in that we have structural items of the form attribute: value
where the value term may or may not itself be structured.
For example, the Name of donor
attribute is simply that - a name - which explicitly represents the name of the donor. But the text associated with the Amount of donation or nature and value if donation in kind
is more unstructured. For sure, we can see recognisable thinks in the description, but the way they are presented is leargely as free text. Which is to say, the way it's presented is arbitrary, which makes it harder to extract information from in a reliable way.
What this means is that there is some low hanging fruit in this register that we can extract reasonably reliably (the name of a donor, for example), but there is also information in that that we may have to parse by hand if we want to do it reliably.
There are many tools available to help you scrape a web page or set of web pages, but I tend to use code becuase it gives me the most control over the scrape, albeit at the cost of added complexity compared to point and click style applications.
A couple of Python packages that crossed my radar recently provide a relatively easy way in to getting started with Python based scrapers, so we'll use those in this recipe.
kennethreitz/requests-html
The first package is Kenneth Reitz' (requests-html, straplined "Pythonic HTML Parsing for Humans™". This package helps us grab an HTML page and extract the text from it.
In [1]:
#https://github.com/kennethreitz/requests-html
#!~/anaconda3/bin/pip install requests_html
r1chardj0n3s/parse
The second package is Richard Jones' parse
package (r1chardj0n3s/parse).
This package provides a set of tools that make it relatively easy to extract the rendered / visually structured information, such as the attributes/values identified in the MPs' register above.
In [2]:
#https://github.com/r1chardj0n3s/parse
#Install the package
#!~/anaconda3/bin/pip install parse
We'll start by working with a single test page: https://publications.parliament.uk/pa/cm/cmregmem/170502/dugher_michael.htm
This was selected from the last register of the 2015 Parliament. For now, I'm just hoping that the register for the current 2017 Parliament has the same structural form!
Disclaimer: the page I picked from the register was picked becuase it had a range of content structures, not for any reason relating to the member it relates to or the actual content of an entry. Which is to say: nothing is implied by the selection of the test page, etc etc.
In [3]:
#Set the url of the test page
url='https://publications.parliament.uk/pa/cm/cmregmem/170502/dugher_michael.htm'
The first thing to do is get hold of the page HTML. The requests_html
makes this easy:
In [4]:
#Import the package - we only need to do this once
from requests_html import HTMLSession
#Create a session - we only need to do this once
session = HTMLSession()
In [5]:
#Grab the page
r = session.get(url)
To find the p
tags within an HTML block with a given id, such as mainTextBlock
, we can apply the .html.find()
method to the page and get a list of items in return.
In [6]:
ptags = r.html.find('#mainTextBlock > p')
#View the contents of the 7th p tag in the list (the index starts at 0)
ptags[6].text
Out[6]:
Now we can start to extract some information as data using the parse
package.
In [7]:
#Import everything from the package - this is not best practice!
from parse import *
The parse
package encourages you to split a text string into recognisable components. The text we want to extract is wrapped using braces ({}). The contents of the braces may be a name we want to assigned to the extracted text, and / or a pattern that describes the text we want to extract "into" that pair of braces.
The expression we need to us has the form:
parse(stringExtractionPattern, stringWeWantToParse)
In [8]:
pattern = '''Name of donor: {name}\nAddress of donor: {addr}\nAmount of donation or nature and value if donation in kind: {txt}\nDonor status: {status}\n(Registered {date})'''
pr = parse(pattern, ptags[6].text)
#View the results of parsing the string
pr
Out[8]:
If the the pattern matcher doesn't match the string, nothing is returned.
In [9]:
pr = parse(pattern,"Some arbitrary text that is unlikely to match...")
print('This returns >>', pr, '<<')
We can make use of this in a condiional statement to take an action depending on whether or not we get a response:
In [10]:
if pr:
print('There was a match')
else:
print('No match')
We can pass the "named" items we have extracted from the text string into its own python dict
.
The python expression used to do this is known as a "list comprehension" (or more specifically in this case, a "dict comprehension"). Essentially what it does is take the contents of one dict
and use them to create another. Don't worry about it: it's voodoo magic...
In [11]:
pr = parse(pattern, ptags[6].text)
extractedItems = {k:pr[k] for k in pr.named}
#Preview items
extractedItems
Out[11]:
We can mask the complexity of the dict comprehension by creating a function that deploys it for us:
In [12]:
def todict(result):
#If there's no match, return an empty dict
if not result: return {}
#If there is a match, add the named results items to the returned dict
return {k:result[k] for k in result.named}
In [13]:
todict(pr)
Out[13]:
If the parse expression does not match the string, then it won't return anything.
Looking at the extracted text, we can see that there are some other elements in there that we might be able to extract.
For example, if the status
is a company, it looks like we might be able to extract out the company number:
In [14]:
#View status
extractedItems['status']
Out[14]:
We can try to extract the fact the entity is a company, along with the company number:
In [15]:
parse('{status2}, registration {cn}', extractedItems['status'])
Out[15]:
We could directly create a dict
from that:
In [16]:
todict( parse('{status2}, registration {cn}', extractedItems['status']) )
Out[16]:
Or we could write a new function that adds any newly extracted items to the extractedItems
dict, and does the parsing:
In [17]:
def todict2(pattern, string, extracted=None):
if extracted is None: extracted = {}
newextract = todict( parse(pattern, string) )
#Add the contents of newextract to the original dict and return
#Note that this updates any dict passed in via the extracted argument and doesn't strictly need to return it
extracted.update(newextract)
return extracted
In [18]:
todict2('{status2}, registration {cn}', extractedItems['status'], extractedItems)
extractedItems
Out[18]:
Alternatively, we might write a function to handle the company number extraction and return the company number (and extracted status) as a dict
.
In [19]:
def companynumber(string):
return todict( parse('{status2}, registration {cn}', string) )
In [20]:
print( companynumber(extractedItems['status']) )
print( companynumber('Arbitrary text.') )
#Check to see if it works with a company number that starts with a leading 0
print( companynumber('company, registration 05741473') )
Let's look at another entry - if we also print it, the end of line (\n
) characters will be rendered make things easier to read:
In [21]:
print(ptags[11].text)
ptags[11].text
Out[21]:
This entry has Date received
and Date accepted
fields that were not in the entry we scraped first. After parsing, they form part of the txt
item:
In [22]:
#If we don't pass a dict in to todict2(), one will be created for us
extracteditems = todict2(pattern, ptags[11].text)
extracteditems
Out[22]:
So let's parse that item and grab the dates.
(The reason we don't add them to the pattern we used earlier is becuase that pattern would then not match the entries that do not contain the received and accepted dates.)
In [23]:
#Here's what we're going to parse
extracteditems['txt']
Out[23]:
In [24]:
datepattern = '{}\nDate received: {dateRxd}\nDate accepted: {dateAccd}'
#This updates extracteditems
todict2(datepattern, extracteditems['txt'], extracteditems)
extracteditems
Out[24]:
We could also clean the text field a bit by splitting the string on \nDate
fragments and just retaining the first part:
In [25]:
extracteditems['txt'].split('\nDate')
Out[25]:
In [26]:
#The split returns a list of items - just grab the first one with index 0
extracteditems['cleanertxt'] = extracteditems['txt'].split('\nDate')[0]
extracteditems
Out[26]:
One thing we might notice at this part is that we have some dates. These are represented as text strings, but we can also part them into a "date-timey" computational thing that identifies the date as a date and let's us do datey things to it.
In [27]:
#The dateutil package extends the standard Python datetime module and helps us parse dates
#~/anaconda3/bin/pip install python-dateutil
from dateutil import parser as dtparser
In [28]:
def parsedate(string):
#This is not best practice - if the parse fails, return None
try:
dt = dtparser.parse(string)
except:
dt = None
return dt
In [29]:
parsedate('8 December 2016')
Out[29]:
Having things in datetime
format lets us work with them as such. For example, we can display them in a variety of ways:
In [30]:
print( parsedate('8 December 2016').strftime("%d/%m/%y") )
print( parsedate('8 December 2016').strftime("%B %d, %Y") )
print( parsedate('8 December 2016').strftime("%A, %B %, %Y") )
print( parsedate('8 December 2016').isoformat() )
There is a reference card for strftime
modifiers / formatters here: http://strftime.org/
We can use the formatter to format our dates for us:
In [31]:
extracteditems['date_f'] = parsedate(extracteditems['date'])
extracteditems
Out[31]:
We can also automate a bit if date items have 'date' to start their name and haven't already been formatted (identified using the _f
suffix as part of their name).
Part of the automation requires creating dict attributes, named after date attributes but with the additional _f
suffix as part of the name. We can use a Python string formatter to help us do this:
In [32]:
'{}_f'.format('date')
Out[32]:
In [33]:
def parsedates(record):
#This looks complicated but what it basically does is look for attributes called date* and not ending _f
for k in [k for k in record.keys() if k.lower().startswith('date') and not k.lower().endswith('_f') ]:
#The record is a dict and is mutable - that is, the dict we passed in is changed by the function
record['{}_f'.format(k)] = parsedate( record[k] )
In [34]:
#Remember - this automatically updates the dict we pass to it
parsedates(extracteditems)
#Show dict updated with parsed dates
extracteditems
Out[34]:
We've now grabbed quite a lot of the low hanging fruit from the page, but what about the money?
Ideally, there should only be one monetary amount specified in an entry (this is not always the case), but for now we'll just make an attempt at grabbing the first. We're also going to assume amounts are converted to, and given as, £ equivalents. To make life easier for the parser, we remove any commas (which is to say, commas used as thouseands separators) from the parsed string by replacing them with an empty string.
In [35]:
print( '£1,250,000.00 #loadsamoney'.replace(',','') )
moneystring='£1,250,000.00 #loadsamoney'
print( moneystring.replace(',','') )
In [36]:
#Define a simple helper function
def commaclean(string):
return string.replace(',','')
In [37]:
#Demo the helper function
commaclean( moneystring )
Out[37]:
Let's test a cash amount detecting pattern to see how it works with different strings.
In [38]:
gbppattern = '{?}£{gbp:g}{}'
def testcashparse(string):
print('{}: {}'.format(string, todict2(gbppattern, commaclean(string)) ))
testcashparse('I got £5,000, okay?')
testcashparse('£5,000')
testcashparse('£5001')
testcashparse('They paid me, in 2015, £5000')
testcashparse('£1,250,000.00. #loadsamoney')
testcashparse('The sum of £2.75 for a coffee which cost £2.75')
testcashparse('I got £1000. In two lots: £200 and £800')
testcashparse('Gaming the system: I got £1.25 then £50,000 on top')
testcashparse("I got 50000 you won't pick up on")
There are some issues with this:
parse
function returns is an index of where the match took place in the parsed string. So if we check the character before a match, if it's a £
we know we're quids in...parse()
function let's use the findall()
function to see if we can find all the sterling amounts.When we manage to parse a string, the parser also returns an index that shows where a match took place:
In [39]:
string = 'A sum of £1000'
locator = parse(gbppattern, string)
locator, locator.spans
Out[39]:
We can retrieve a particular character from a string by passing referencing its index value, starting at the beginiing of the string (counting up from index 0) or the end of the string (an increasinglt negative count starting with index value -1).
In [40]:
'abc DEF'[0], 'abc DEF'[4], 'abc DEF'[-1]
Out[40]:
So we can find the character before a match string as follows:
In [41]:
ix = locator.spans['gbp'][0]-1
string[ ix ]
Out[41]:
We can handle the end-of-line character match requirement by adding whitespace around the string before we parse it, also adding in the ability to check for the presence of a currency symbol.
In [42]:
def moneyfudge(string):
return ' {} '.format( commaclean(string))
In [43]:
moneystring = 'Gaming the system: I got £1.25 then £50,000 on top in 2015'
items=findall(gbppattern, moneyfudge(moneystring))
for i in items:
print(i, moneyfudge(moneystring)[i.spans['gbp'][0]-1] )
Let's create a function to get the financial amounts from a string identified by a preceding currency symbol such as a £ sign.
Preemptively, we can also build in a check for amounts in other currencies, using generate a simple dict
to map currency symbols onto currency labels.
In [44]:
currencies={'£': 'GBP',
'$': 'USD',
'€': 'EUR'}
for unit in ['£', '€', '$']:
print( currencies[unit] )
Here are the currency symbols we know about:
In [45]:
currencies.keys()
Out[45]:
In [46]:
def getamounts(string, numpattern = '{}{num:g}{}'):
response = {}
response['amounts'] = []
response['currency'] = []
for amount in findall(numpattern, moneyfudge(string) ):
currency = moneyfudge(string)[amount.spans['num'][0]-1]
if currency in currencies.keys():
response['amounts'].append(amount['num'])
response['currency'].append(currencies[currency])
return response
In [47]:
getamounts(moneystring)
Out[47]:
In [48]:
print( getamounts('I got £5,000, okay?') )
print( getamounts('£5,000') )
print( getamounts('£5001') )
print( getamounts('They paid me, in 2015, £5000') )
print( getamounts('£1,250,000.00. #loadsamoney') )
print( getamounts('The sum of £2.75 for a coffee which cost £2.75') )
print( getamounts('I got £1000. In two lots: £200 and £800') )
print( getamounts('Gaming the system: I got £1.25 then £50,000 on top') )
print( getamounts("I got 50000 you won't pick up on") )
Right - so we can pull out amounts, if preceded by a currency symbol. (We could perhaps also catch other numbers that don't look like years around the reporting period?) into an "other numbers" list for further investigation?) Let's start to think about generating some new items for our extracteditems
record. We'll create function that returns the maximum, summed and itemised amounts that we can use variously if we want to go digging in the data.
We can also try to be clever, and where more that two items are listed, calculate the difference between the total sum and the the maximum amount. If these are equal (or if we are being more elaborate, nearly equal) then we might want to check the text to see if the larger amount is specifying the sum of the smaller amounts (in which case, the total sum item is meaningless).
Where there are mutliple amounts, we can create a serialised version of the list.
In [49]:
amounts = [100, 250.25, 1000]
#serialise the amounts - making sure they are represented as strings first
'::'.join([str(amount) for amount in amounts])
Out[49]:
When doing the sums, we need to make sure that the sums are calculated on the same currencies. It's easier to do this using a tabular data representation built for doing spreadsheet like operations. The pandas package is the go to package for working with tabular datasets, conventionally imported using the name pd
, so let's load it in.
In [50]:
#!~/anaconda3/bin/pip install pandas
import pandas as pd
We can generate a dataframe directly from a list of dict
s. The dict
keys specify the column names, and each list entry represents a row.
In [51]:
#Generate a test example dataframe
df = pd.DataFrame( getamounts('I got £1000. In two lots: £200 and £800. And then $500') )
df
Out[51]:
We can also filter the rows to show just the rows where one or more columns have specified values:
In [52]:
df[ df['currency']=='GBP' ]
Out[52]:
We can check whether a dataframe, df
, has at at least one row by testing whether it is empty or not:
In [53]:
print( df.empty )
print( pd.DataFrame([]).empty )
We can also return a dataframe as a dict
in a variety of orientations:
In [54]:
df.to_dict(orient = 'list')
Out[54]:
Now let's build a function to generate some amount related measures on our extracted data, by currency.
In [55]:
def getcheckamounts(string, numpattern = '{}{num:g}{?D}'):
amounts = getamounts(string, numpattern)
#Generate a dataframe of the amounts
df_amounts = pd.DataFrame( amounts )
#If there arenlt any amounts, we can bail out now
if df_amounts.empty: return {}
response = {}
#Generate a dataframe containing just the ukp amounts
df_ukpamounts = df_amounts[df_amounts['currency']=='GBP']
response['maxamountGBP'] = max(df_ukpamounts['amounts']) if not df_ukpamounts.empty else 0
response['numamountsGBP'] = len(df_ukpamounts)
response['sumamountsGBP'] = sum(df_ukpamounts['amounts']) if not df_ukpamounts.empty else 0
response['sumlessmaxGBP'] = response['sumamountsGBP'] - response['maxamountGBP'] if len(df_ukpamounts['amounts'])>2 else 0
response['numamountsOther'] = len(df_amounts) - len(df_ukpamounts)
#Record amounts for all currency items
response['amounts'] = '::'.join([str(amount) for amount in df_amounts['amounts']])
response['currencies'] = '::'.join([str(currency) for currency in df_amounts['currency']])
return response
In [56]:
tests = ['I got £1.25 then £50,000 on top in 2015',
'£1.25',
'No money',
'Four payments, £100, £200 and £400 to give £701 total, and then $5000 more' ]
for test in tests:
print( test, getcheckamounts(test) , '\n')
This may look overly complex, but this is just a beginning if we are to try to buid logic in that will detect amounts submitted, for whatever reason, in arbitrary ways.
If amounts are submitted in other currencies, we can use date information to calculate equivalent sterling amounts using historical exchange rate data.
The forex-python
package provides one way of doing this.
In [57]:
#!~/anaconda3/bin/pip install forex-python
In [58]:
testdate = parsedate('February 10th 2015')
testdate
Out[58]:
In [59]:
from forex_python.converter import CurrencyRates
c = CurrencyRates()
#Convert $100 to UKP using exchange rate as of February, 2015
c.convert('USD', 'GBP', 100, testdate)
Out[59]:
In [60]:
extracteditems['txt']
Out[60]:
In [61]:
# The .update() method updates the contents of the dict directly
extracteditems.update( getcheckamounts( extracteditems['txt']) )
extracteditems
Out[61]:
Looking back at the original page, we notice that the entries are grouped acorrding to different sorts of interest, such as 2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation or 3. Gifts, benefits and hospitality from UK sources.
We can detect when we enter a section by detecting a paragraph that starts with one of these headings:
In [62]:
for p in ptags:
#The .startswith() method accepts a tuple (enumeration of things in a pair of brackets) to check against
#If the text startswith any of the strings in the tuple, the condition evaluates true for that text
if p.text and p.text.startswith(('1.','2.','3.','4.','5.','6.','7.','8.')):
print(p.text)
Looking through some of the member pages, there are other elements of structure that we should be able to pull on. For example, many pages have content of the following form in the Employment and earnings
section which we should be able to pull out:
1. Employment and earnings
Payments from ComRes, 4 Millbank, London SW1P 3JA:
26 January 2016, £75 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 03 August 2016)
4 April 2016, £75 for participating in Parliamentary Panel Survey. Hours: 20 mins. (Registered 03 August 2016)
20 May 2016, £100 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 03 August 2016)
Payments from YouGov, 50 Featherstone St, London EC1Y 8RT:
13 Jan 2016, £50 for participating in an online survey. Hours: 15 mins. (Registered 05 December 2016)
4 Mar 2016, £30 for participating in an online survey. Hours: 15 mins. (Registered 05 December 2016)
In [63]:
#Example text
ptags[2].text
Out[63]:
In [64]:
employmentpattern1 = '''{empdate},{txt2} Hours: {hours}(Registered {date})'''
In [65]:
todict2( employmentpattern1, ptags[2].text )
Out[65]:
In [66]:
string = '26 January 2016, £75 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 03 August 2016)'
todict2( employmentpattern1,string )
Out[66]:
In [67]:
employmentsubpattern1 = '''{emptxt} for {empfor} from {empfrom}'''
employmentsubpattern2 = '''{emptxt} from {empfrom} for {empfor}'''
employmentsubpattern3 = '''{emptxt} for {empfor}'''
employmentsubpattern4 = '''{emptxt} from {empfrom}'''
In [68]:
teststrings = ['£75 for participating in Parliamentary Panel Survey',
'£75 from this or that org for participating in Parliamentary Panel Survey',
'£75 for participating in Parliamentary Panel Survey from this or that org',
'£75 from this or that org',
'£75 from the Society for Whatever for doing a thing',
'£75 from the Society for Whatever',
'£75 for doing whatever from the Society for Whatever',
'£75 from the Society for Whatever and for Whenever for doing whatever ']
for string in teststrings:
test = todict2( employmentsubpattern1, string)
if not test:
test = todict2( employmentsubpattern2, string)
if not test:
test = todict2( employmentsubpattern3, string)
if not test:
test = todict2( employmentsubpattern4, string)
print(string, test,'\n')
One of the issues we have here is identifying whether or nor a for is part of the name or not.
If a for is extracted into a "for" attribute, we can perhaps assume that it needs pushing back into the "from" attribute. But things can get messy as the above examples show,
So this is largely parked for now other than to bludgeon the hadnling of a "for in for".
In [69]:
def forcatcher(record):
#Try to catch the "Society for Whatever"
#Also things like the Society for This and for That
if 'for' in record and ' for ' in record['for']:
record['from'] = ' for '.join([record['from']] + record['for'].split('for')[:-1]).strip()
record['for'] = record['for'].split('for')[-1].strip()
#We can get rid of any double spaces by splitting and rejoining a string on a space
record['from'] = ' '.join(record['from'].split())
record['for'] = ' '.join(record['for'].split())
return record
In [70]:
def getemploymentsubdetails(string):
empdetail = todict2( employmentsubpattern1, string)
if not empdetail:
empdetail = todict2( employmentsubpattern2, string)
if not empdetail:
empdetail = todict2( employmentsubpattern3, string)
if not empdetail:
empdetail = todict2( employmentsubpattern4, string)
return empdetail
In [71]:
for string in teststrings:
test = getemploymentsubdetails(string)
print(string, forcatcher(test),'\n')
In [72]:
def getemploymentdetails(string):
employmentdetails = todict2( employmentpattern1, string )
#Enrich with further extractions
if employmentdetails:
employmentdetails.update( getemploymentsubdetails(employmentdetails['txt2']) )
employmentdetails.update( getcheckamounts(employmentdetails['txt2']) )
employmentdetails = forcatcher(employmentdetails)
return employmentdetails
In [73]:
getemploymentdetails(ptags[2].text)
Out[73]:
Inspection of some other records suggest that sometimes a nested structure might be used to represent payments from the same source:
In [74]:
url2='https://publications.parliament.uk/pa/cm/cmregmem/170502/gray_james.htm'
r2 = session.get(url2)
ptags2 = r2.html.find('#mainTextBlock > p')
In [75]:
for i in range(2,15):
print(ptags2[i].text)
In [76]:
df = pd.DataFrame()
for i in range(18,24):
empdetails = getemploymentdetails(ptags2[i].text)
if empdetails:
df = pd.concat([df,pd.DataFrame([empdetails])])
else:
#Add a row containing unparsed text
dummy = {'txt':ptags2[i].text}
df = pd.concat([df,pd.DataFrame([dummy])])
extracteditems['txt'] = p.text
#Reset the index
df.reset_index(drop=True, inplace=True)
df[['txt','txt2']]
Out[76]:
Inspecting tha dataframe, we notice that we can "fill down" on the dataframe to generate a string that might provide information about who made a payment.
In [77]:
#Create a new working column
df['txt4'] = df['txt']
#Fill down
df['txt4'] = df['txt4'].fillna(method='ffill')
df['txt4']
Out[77]:
In [78]:
emppaymentsubpattern1 = '''{} from {empfromsub}, {empaddr}'''
emppaymentsubpattern2 = '''{} via {empfromsub}, {empaddr}'''
emppaymentsubpattern3 = '''{} from {empfromsub}'''
emppaymentsubpattern4 = '''{} via {empfromsub}'''
In [79]:
def getemppaymentsubdetails(string):
#Check that we have a valid string to parse, else return an empty dict
if pd.isnull(string): return {}
emppaymentdetail = todict2( emppaymentsubpattern1, string)
if not emppaymentdetail:
emppaymentdetail = todict2( emppaymentsubpattern2, string)
if not emppaymentdetail:
emppaymentdetail = todict2( emppaymentsubpattern3, string)
if not emppaymentdetail:
emppaymentdetail = todict2( emppaymentsubpattern4, string)
return emppaymentdetail
In [80]:
df['txt4'].apply(getemppaymentsubdetails)
Out[80]:
We can then cast the column that contains the extracted dict
across new columns.
In [81]:
df['txt4'].apply(getemppaymentsubdetails).apply(pd.Series)
Out[81]:
In [82]:
df = df.join( df['txt4'].apply(getemppaymentsubdetails).apply(pd.Series))
df
Out[82]:
We'll be able to make use of this later as a post-processor on a dataframe generated from entries on the whole page.
The section recording visits outside the UK looks as if it may have a regular structure:
4. Visits outside the UK
Name of donor: The Mamont Foundation
Address of donor: c/o Rothschild Trust Guernsey Ltd, PO Box 472, St Julian’s Court, St Julian’s Avenue, St Peter Port GY1 6AX, Guernsey, Channel Islands
Estimate of the probable value (or amount of any donation): transport, accommodation, food and drink at an estimated cost of £1,500
Destination of visit: Arkhangelsk
Date of visit: 28 – 31 March 2017
Purpose of visit: To attend the ‘Arctic: Territory of Dialogue’ International Arctic Forum.
(Registered 20 April 2017)
In [83]:
#Example text
ptags[23].text
Out[83]:
By inspection of unparsed strings, theremay be several formats for recording the overseas entries visits. We can capture these use a default hierachy that tests each in turn.
In [84]:
#Note the {} fudge to account for Date/Dates
outsideukvisitpattern1 = '''Name of donor: {name}\nAddress of donor: {addr}\nEstimate of the probable value (or amount of any donation): {visitestimate}\nDestination of visit: {visitDest}\nDat{} of visit: {visitdates}\nPurpose of visit: {visitpurpose}\n(Registered {date})'''
outsideukvisitpattern2 = '''Name of donor: {name}\nAddress of donor: {addr}\nAmount of donation (or estimate of the probable value): {visitestimate}\nDestination of visit: {visitDest}\nDat{} of visit: {visitdates}\nPurpose of visit: {visitpurpose}\n(Registered {date})'''
def getoutsideUKvisit(string):
visit = todict2(outsideukvisitpattern1, string)
if not visit:
visit = todict2(outsideukvisitpattern2, string)
if visit:
visit.update( getcheckamounts(visit['visitestimate']))
return visit
In [85]:
for i in [23, 24, 25]:
string = ptags[i].text
print(string, getoutsideUKvisit(string), '\n')
In [87]:
section = ''
#iterate through the paragraphs
for p in ptags[:15]:
#Check to see if we're in a new section. If so, capture the section
if p.text and p.text.startswith(('1.','2.','3.','4.','5.','6.','7.','8.')):
section = p.text
#Do the preliminary parsing of a paragraph
extracteditems = todict2(pattern, p.text)
#Identify the section
extracteditems['section'] = section
#Look for the data - checking first there's a txt tag that's been extracted...
if extracteditems and 'txt' in extracteditems:
#Dates
todict2(datepattern, extracteditems['txt'], extracteditems)
#Get a simmplified version of the text string, without dates, to potentially make life easier in the future
extracteditems['cleanertxt'] = extracteditems['txt'].split('\nDate')[0]
#Dateify any dates
parsedates(extracteditems)
#Extract any company numbers that are declared
extracteditems.update( companynumber(extracteditems['status']) )
#Money
extracteditems.update( getcheckamounts( extracteditems['txt']) )
print(extracteditems)
We can build some optimisation in by moving to the next paragraph if we have already completely paresed the current paragraph using the continue
statement.
In [88]:
for i in ['a','b','c','d']:
if i=='b' : continue
print(i)
One of the things our parsing to date has omitted to do is grab paragraphs / entries that we have not been able to parse. Keeping a list of these is important because it shows us what entries we have not been able to parse. We will build something into out processor to capture these items, grabbing any financial amounts, if we can detect any, as before.
In [89]:
#To make the code reusable, let's wrap it in a function:
def scrapeData(ptags, omitFirstRow=True, saveFullText=True):
#The code is pretty much as it was before...
#...except that at first we create an empty dataframe
df = pd.DataFrame()
section = ''
start = 1 if omitFirstRow else 0
mpname = ptags[0].text
#iterate through the paragraphs, omittig the first one, which is the member name
for p in ptags[start:]:
#Check to see if we're in a new section. If so, capture the section
if p.text and p.text.startswith(('1.','2.','3.','4.','5.','6.','7.','8.', '9.', '10.')):
section = p.text
continue
if saveFullText: df['fulltext'] = p.text
#Do the preliminary parsing of a paragraph
extracteditems = todict2(pattern, p.text, )
if extracteditems:
#Identify the section
extracteditems['section'] = section
#Dates
todict2(datepattern, extracteditems['txt'], extracteditems)
#Get a simmplified version of the text string, without dates, to potentially make life easier in the future
extracteditems['cleanertxt'] = extracteditems['txt'].split('\nDate')[0]
#Dateify any dates
parsedates(extracteditems)
#Extract any company numbers that are declared
extracteditems.update( companynumber(extracteditems['status']) )
#Money
extracteditems.update( getcheckamounts( extracteditems['txt']) )
#Add a column to say we have structurally arsed the row
extracteditems['parsed'] = True
#...and then we add each record to it
df = pd.concat([df,pd.DataFrame([extracteditems])])
continue
#Visits outside UK
extracteditems = getoutsideUKvisit(p.text)
if extracteditems:
#Identify the section
extracteditems['section'] = section
#Add a column to say we have structurally arsed the row
extracteditems['parsed'] = True
#...and then we add each record to it
df = pd.concat([df,pd.DataFrame([extracteditems])])
continue
#Employment
extracteditems = getemploymentdetails(p.text)
if extracteditems:
#Identify the section
extracteditems['section'] = section
#Add a column to say we have structurally arsed the row
extracteditems['parsed'] = True
#...and then we add each record to it
df = pd.concat([df,pd.DataFrame([extracteditems])])
continue
#This is the catch all for unparsed rows that contain content
if p.text:
#Grab any amounts
extracteditems = getcheckamounts( p.text )
#Identify the section
extracteditems['section'] = section
#Keep a record of the text
extracteditems['txt'] = p.text
#Record that we haven't run this through a structured parser
extracteditems['parsed'] = False
#Add to the list of entries
df = pd.concat([df,pd.DataFrame([extracteditems])])
#We need to record which MP it was...
df['mpname'] = mpname
#Return the dataframe... resetting the index for the whole dataframe
#We can ignore (drop) the original index:
# it contains "dummy" values created when the single row dataframe was made from each record
return df.reset_index(drop=True)
In [90]:
#Let's if it works...
tmp = scrapeData(ptags)
tmp.head(), tmp.tail()
Out[90]:
If everything has gone to plan, we should be able to scrape the data from another page...
In [91]:
scrapeData(ptags2).head()
Out[91]:
In this case, we see how we can use the table as a whole to extract more information, as for example in the case of the employmnet details that were extracted as a nested list.
In [92]:
# It might be tidier to run this on the dataframe as we build it up
#That is, as soon as we have processed the employment section, section 1.
def scrapeData2(ptags, omitFirstRow=True):
df = scrapeData(ptags, omitFirstRow)
#Postprocessor - add in the payment info
if not df.empty and 'txt' in df.columns:
df['txt4'] = df['txt']
#Fill down
df['txt4'] = df.loc[df['section'].str.startswith('1.')]['txt4'].fillna(method='ffill')
tmp= df['txt4'].apply(getemppaymentsubdetails).apply(pd.Series)
#This is getting Frankeseteinian now.... :-(
if 'empfromsub' in df:
tmp['empfromsubfilldown'] = tmp['empfromsub'].fillna(method='ffill')
#The parsing can be a bit flaky so we need to trap in cases where we don't extract anything
if not tmp.empty:
df = df.join(tmp)
return df
scrapeData2(ptags2)[['txt4', 'txt2', 'txt','empaddr','empfromsub']].head()
Out[92]:
As we scrape more pages, we may be able to identify more common structures in different sections of the register and construct additional structured parsers for them.
It would probably have made sense to have looked at the regulations before building the scraper to see if they describe required fields: House of Commons Guide to the Rules relating to the Conduct of Members. An FOI request about structured items might also be useful?
Now let's move on to stage 2 - getting a list of URLs for all the MPs so we can do a scrape of the whole register.
We can find the list of links to individual MPs on a register page such as https://publications.parliament.uk/pa/cm/cmregmem/170502/contents.htm.
Inspecting the page using developer tools again suggests that the mainTextBlock
is a good place to start. However, there is also a parse function that lets us just grab all the links on the page: r.html.absolute_links
.
The URLs to the corresponding member pages have a similar path, such as:
Full register URL: https://publications.parliament.uk/pa/cm/cmregmem/170502/contents.htm
MP register URL: https://publications.parliament.uk/pa/cm/cmregmem/170502/dugher_michael.htm
This means that we can filter all the links on the page to just the ones that start with path to that register.
In [93]:
registerListurl = 'https://publications.parliament.uk/pa/cm/cmregmem/170502/contents.htm'
registerListurl = 'https://publications.parliament.uk/pa/cm/cmregmem/180305/contents.htm'
In [94]:
registerPath = '/'.join(registerListurl.split('/')[:-1])
registerPath
Out[94]:
In [95]:
r3 = session.get(registerListurl)
links = r3.html.absolute_links
mplinks = [link for link in links if link.startswith(registerPath)]
mplinks[:3]
Out[95]:
Now we can use this set of links to scrape the whole register. We do this by concatenating sepatrate dataframes scraped for each member, from their page, into a single dataframe.
In case the big scrape throws up errors in the single page scraper, we can cache (that is, keep hold of a local copy of) the individual pages so that we don't have to keep hitting the Parliament website. It's also generally consider good practice to put in a small delay betweem page requests. Making this delay random means that the servere we are hitting is perhaps less likely to detect we're scraping it.
We can use the requests-cache
package to cache page loads for a specified time, either in memory, or persistently to a sqlite database.
In [96]:
#!~/anaconda3/bin/pip install requests-cache
import requests_cache
requests_cache.install_cache()
The time
library lets us introduce a delay, in seconds. The random
library can generate this time for us.
In [97]:
import random
import time
delay = random.uniform(4.1,4.5)
print('Wait for {}s...'.format(delay))
time.sleep(delay)
print('...done')
In [98]:
df_full = pd.DataFrame()
for mpurl in mplinks:
rr = session.get(mpurl)
mptags = rr.html.find('#mainTextBlock > p')
#print(mpurl)
df_mp = scrapeData2(mptags)
df_full = pd.concat([df_full,df_mp])
#Be nice - although there should be no need to do this if the cache is working, after first run?
#time.sleep(random.uniform(0.1,0.5))
In [99]:
dbname = 'mpregister.sqlite'
dbname = 'mpregisterLatest.sqlite'
In [100]:
!rm {dbname}
In [101]:
import sqlite3
from pandas.io import sql
# Create a connection to the database
conn = sqlite3.connect(dbname)
In [102]:
#Save the df to the database
#To do this, we need to set date types to a standard text format
tablename = 'mpregfinint'
for c in df_full.columns:
if c.endswith('_f'): df_full[c] = df_full[c].dt.strftime('%Y-%m-%d')
df_full.to_sql(tablename, conn, index = False, if_exists='replace')
In [ ]:
This notebook has walked through the creation of a recipe for scraping the contents of the register of financial interests of a single MP.
In the next installment, we'll look at how to grab the data for all the members listed in a particular register. This will be followed - possibly! - by a look at how to query the data once we have scraped it, as well as how to enrich the dataset and make it "linkable" to other datasets using common identifiers, such as MNIS / MP ids and company numbers.